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Abstract 

The Maximum Likelihood (ML) and Cross VaUdation (CV) methods for esti- 
mating covariance hyper-parameters are compared, in the context of Kriging 
with a misspecified covariance structure. A two-step approach is used. First, 
the case of the estimation of a single variance hyper-parameter is addressed, 
for which the fixed correlation function is misspecified. A predictive variance 
based quality criterion is introduced and a closed-form expression of this crite- 
rion is derived. It is shown that when the correlation function is misspecified, 
the CV does better compared to ML, while ML is optimal when the model is 
well-specified. In the second step, the results of the first step are extended to the 
case when the hyper-parameters of the correlation function are also estimated 
from data. 
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1. Introduction 



Kriging models (Matheron 1970 Stein 1999 Rasmussen and Williams 



|2006^ consist in interpolating the values of a Gaussian random field given ob 
servations at a finite set of observation points. They have become a popular 
method for a large range of applications, such as numerical code approximation 
(j Sacks et al. 1989 Santner et al. 



global optimization (Jones et al. 



2003') and calibration (IPaulo et all 120121) or 



1998| ). 



One of the main issues regarding Kriging is the choice of the covariance 
function. A Kriging model yields an unbiased predictor, with minimal variance 
and a correct predictive variance, if the covariance function used in the model 
coincides with that of the random field from which the observations stem. Sig- 
nificant results concerning the influence of a misspecified covariance function 
on Kriging predictions are available (Stein 1988, ^1990a,cj ). These results are 
based on the fixed-domain asymptotics (Stein 1999 p. 62), that is to say the 
case when the infinite sequence of observation points is dense in a bounded 
domain. In this setting, the results in ( jStein^ |1988; ^1990a||C| ) state that if the 
true and assumed covariance functions yield equivalent Gaussian measures (see 
e.g Stein (1999 p. 110) and Ibragimov and Rozanov (1978 p. 63)), then there is 
asymptotically no loss using the incorrect covariance function. The asymptotic 
optimality holds for both the predictive means and variances. 

Hence, in the fixed-domain asymptotics framework, it is sufficient to esti- 
mate a covariance function equivalent to the true one. Usually, it is assumed 
that the covariance function belongs to a given parametric family. In this case, 
the estimation boils down to estimating the corresponding parameters, that are 
called "hyper-parameters". Because of the above theoretical results, it is useful 
to distinguish between microergodic and non-microergodic hyper-parameters. 
Following the definition in Stein (1999), an hyper-parameter is microergodic if 



two covariance functions are orthogonal whenever they differ for it (as in ^Stein 



(1999), we say that two covariance functions are orthogonal if the two under- 
lying Gaussian measures are orthogonal). Non-microergodic hyper-parameters 
cannot be consistently estimated but have no asymptotic influence on Kriging 
predictions, as shown in Zhang ( 2004 ) in the Matern case. There is a fair amount 
of literature on the consistent estimation of microergodic hyper-parameters, par- 
ticularly using the Maximum Likelihood (ML) method. Concerning the isotropic 
Matern model, with fixed regularity parameter v and free variance and corre- 

2 

lation length ^, the microergodic ratio ^ can be consistently estimated by ML 
for dimension c? < 3 ( Zhang, ^2004) , with asymptotic normality for d 



1 (Du 



et al. 



2009). Both and £ can be consistently estimated for d > 4 (Anderes 
2010). For the multiplicative Matern model with j/ = |, both and the d 
correlation length hyper-parameters £i, are consistently estimated by ML 
for d > 3 (Loh 2005). For the Gaussian model, the d correlation lengths are 

Finally for the multiplica- 



consistently estimated by ML (Loh and Lam 



2000). 



tive exponential model, the microergodic ratio ^ is consistently estimated by 
ML with asymptotic normality for c? = 1 (Ying 1991). All hyper-parameters 
(7^, ^1, £d are consistently estimated by ML with asymptotic normality for 



2 



d > 1 (Ying 1993). 



We believe that the fixed-domain asymptotics does not solve completely the 
issue of the estimation of the covariance function. The first point is that the 
above theoretical results are asymptotic, while one could be interested in finite- 



sample studies, such as the numerical experiments performed in Stein ( 1999 



ch.6.9), and the detailed study of the exponential covariance in Zhang and Zim 



merman 



in 


Stein 


( 


1999 


Zhang and Zim- 



(20051. 



The second point is that one may not be able to asymptotically 
estimate a covariance function equivalent to the true one. Indeed, for instance, 
for two covariance functions of the isotropic Matern class to be equivalent, it 
is necessary that their regularity parameters are equal (see the one-dimensional 
example in Stein (1999 p. 136)). Yet, it is common practice, especially for the 
analysis of computer experiment data, to enforce the regularity parameter to 
an arbitrary value (see e.g Martin and Simpson (2004)). The interplay between 
the misspecification of the regularity parameter and the prediction mean square 
error is not trivial. The numerical experiments in Vazquez (2005, ch.5.3.3) show 
that a misspecified regularity parameter can have dramatic consequences. 

The elements pointed out above justify addressing the case when a para- 
metric estimation is carried out, within a covariance function set, and when the 
true underlying covariance function does not belong to this set. We call this 
the model misspecification case. In this context, we study the Cross Validation 



(CV) estimation method (Sundararajan and Keerthi 2001 Zhang and Wang 



2010[ ), and compare it with ML. This comparison has been an area of active 



research. Concerning theoretical results. Stein ( 1990b ) showed that for the es 
timation of a signal-to-noise ratio parameter of a Brownian motion, CV has 
twice the asymptotic variance of ML in a well-specified case. Several numerical 
results are also available, coming either from Monte Carlo studies as in Santner| 
et al. (2003 ch.3) or deterministic studies as in Martin and Simpson (2004). In 



both the above studies, the interpolated functions are smooth, and the covari- 
ance structures are adapted, being Gaussian in Martin and Simpson ( 2004 1 and 



having a free smoothness parameter in Santner et al. 
We use a two-step approach. 



(2003). 



In the first step, we consider a parametric fam- 
ily of stationary covariance functions in which only the global variance hyper- 
parameter is free. In this framework, we carry out a detailed and quantitative 
finite-sample comparison, using the closed-form expressions for the estimated 
variances for both the ML and CV methods. For the second step we study 
the general case in which the global variance hyper-parameter and the correla- 
tion hyper-parameters are free and estimated from data. We perform extensive 
numerical experiments on analytical functions, with various misspecifications, 
and we compare the Kriging models obtained with the ML and CV estimated 
hyper-parameters. 

The paper is organized as follows. In Section [2j we detail the statistical 
framework for the estimation of a single variance hyper-parameter, we introduce 
an original quality criterion for a variance estimator, and we give a closed- 
form formula of this criterion for a large family of estimators. In Section [3] 
we introduce the ML and Leave-One-Out (LOO) estimators of the variance 
hyper-parameter. In Section|4]we numerically apply the closed-form formulas of 
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Section[2]and we study their dependences with respect to model misspecification 
and number of observation points. We highlight our main result that when the 
correlation model is misspecified, the CV does better compared to ML. Finally 
in Section [5] we illustrate this result on the Ishigami analytical function and 
then generalize it, on the Ishigami and Morris analytical functions, to the case 
where the correlation hyper-parameters are estimated as well. 



2. Framework for the variance hyper-parameter estimation 

We consider a Gaussian process Y, indexed by a set X. Y is stationary, with 
unit variance, and its correlation function is denoted by A Kriging model 
is built for Y, for which it is assumed that Y is centered and that its covariance 
function belongs to the set C, with 

C = {(7^B.2,(y'' , (1) 

with R2{x) a given stationary correlation function. Throughout this paper, 
Ei, vari, covi and ^i, i G {1,2}, denote means, variances, covariances and 
probability laws taken with respect to the distribution of Y with mean zero, 
variance one, and the correlation function i?^. We observe Y on the points 
xi,...,Xn G In this framework, the hyper-parameter is estimated from 
the data y = (j/i, ?/„)* — (F(a;i), F(a;„))* using an estimator ct^. This 
estimation does not affect the Kriging prediction of j/o — ^(^^o): for a new point 
Xo, which we denote by yo: 

yo :=E2(yol2/) =72r^'y, (2) 

where (7^)^- = Ri{xj - xq) and iTi)j^k = Ri{xj - Xk), i € {1,2}, I < j,k < n. 
The conditional mean square error of this non-optimal prediction is, after a 
simple calculation. 

El [(yo - yo)'\y] = iilr^'y - i\T^'vf + 1 - 7?rrVi. (3) 

However, using the covariance family C, we use the classical Kriging predictive 
variance expression ct^c^^ , with 



c:„ := var2{yo\y) = 1 - -fiV^ ^2- (4) 



As we are interested in the accuracy of the predictive variances obtained from 
an estimator ct^, the following notion of Risk can be formulated. 

Definition 2.1. For an estimator of , we call Risk at xq and denote by 
TZs-^,xo quantity 



T^a^.xn = El 



(El [iyo-yor\y]-a'clf 



4 



If l^a^^xo is small, then this means that the predictive variance a^c^ 
correct prediction of the conditional mean square error ([s]) of the prediction ijQ. 
Note that when i?i — R2 the minimizer of the Risk at every xq is = 1. When 
Ri 7^-^2 5 an estimate of different from 1 can improve the predictive variance, 
partly compensating for the correlation function error. 

To complete this section, we give the closed-form expression of the Risk of an 
estimator that can be written as a quadratic form of the observations, which 
is the case for all classical estimators, including the ML and CV estimators of 
Section HI 

Proposition 2.2. Let be an estimator of of the form y'^lsAy with M an 
n X n matrix. Denoting /(A,B) — tr{A)tr{'B) + 2ir(AB), for A, B n x n 
real matrices, Mq = {T^^-f2 - T^^ji){-fi,T^^ - -/{I^Y Mi = MFi, ci = 

1 — 7*rj^^7i and C2 = 1 — 7|r^"'"72, we have: 

71^2,^, = /(Mo,Mo) + 2citr(Mo)-2c2/(Mo,Mi) 
+4 - 2ciC2ir(Mi) + c^/(Mi, Ml). 



It seems difficult at first sight to conclude from Proposition 2.2 whether one 
estimator is better than another given a correlation function error and a set of 
observation points. Therefore, in Section |4j we numerically analyze the Risk for 
the ML and CV estimators of the variance for several designs of experiments. 
Before that, we introduce the ML and CV estimators of a^. 



3. ML and CV estimation of the variance hyper-parameter 



In the framework of Section [2j the ML estimator a 
et all (120031 p.66)) is 

1 



ML 



of a (see e.g Santner 



'ML 



(5) 



Let us now consider the CV estimator of tr^. The principle is that, given 
a value cr^ specifying the covariance function used among the set C, we can, 
for 1 < i < n, compute y^-i := E2(?/i|yi, y^-i, yi+i, ?/„) and cr^cf _,^ := 
o''^var2{yi\yi, y^-i, y^+i, ...,?/„). The Cross Validation estimate of is based 
on the criterion 

1 (y» - Vi-i)'^ 



C 



LOO 



(6) 



It is noted in Cressie ( 1993 p. 102) that if CT^i?2 is a correct estimate of the true 
covariance function, then we should expect this criterion to be close to 1. In 
[Appendix C[ we show that if the observation points expand in a regular way, 
then for most classical correlation functions (the Matern family for example), if 
fT^i?2 is the true covariance function, then ^ converges toward 1 in the mean 
square sense. 

Based on this rigorous result about the case of an infinite regular grid, we 
can conjecture that ^ should be close to 1 in a general context when the value 
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of CT^ is correct. Hence the idea of the CV estimation of tr^ is to seek the value 
of so that this criterion is equal to 1, which yields the estimator 



1 (yi - Vi-if 



1=1 



(7) 



By means of the well-known virtual LOO formulas of Proposition 3.1 (see 
e.g Ripley (1981 ch.5.2)), we obtain the following vector-matrix closed-form 
expression 

n 

with diag(T2^) the matrix obtained by setting to all non-diagonal terms of 
r^"'^. In the supplementary material, we give a short reminder, with proofs, of 
the virtual LOO formulas given in proposition |3.1[ 

Proposition 3.1. For 1 < i < n: 



and ^ 

2 

When Ri — R2, we see that ML is more efficient than CV. Indeed 

7^a2,,„ = El ((1 - 7?rrSi) - ^'(1 - 7?rrVi))' = {i-^lr^jifE^iia'-if), 

(8) 

so that the Risk of definition |2.1| is proportional to the quadratic error in es- 
timating the true cr^ = 1. We calculate Ei((t^^^) = Ei((T^y) = 1, hence both 
es timators arc u nbiased. Concerning their variances, on the one hand we recall 
in . 



of 
hand 



Appendix B that wari((T|,j^) = -, the Cramer-Rao bound for the estimation 
cr^, in the case when the true is 1 and the model for V is C On the other 



n 



r. n n ('T-1^2 



~{ (^1 



1=1 j = 

Hence vari{a1;y) > ^ X)ILi (r-i)^(r-'i)i . ^ h Cramer-Rao bound. 

However vari{aQy) is only upper-bounded by 2 (because T^^ is a covariance 
matrix). Furthermore vari{a'^y) can be arbitrarily close to 2. Indeed, with 

n — 1 n — 1 
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where J is the n x n matrix with all coefficients being 1, we obtain 

..2 ^ 2 2(71-1) £2 

vari[a(jy) = - + 



2 ■ 



n n (e + (n - 1)(1 - e)) 

Hence, when Ri — R2, ML is more efficient to estimate the variance parameter. 
The object of the next section is to study the case i?i 7^ R2 numerically. 



4. Numerical results for the variance hyper-parameter estimation 

All the numerical experiments are carried out with the numerical software 
Scilab (Gomez et al. 1999). We use the Mersenne Twister pseudo random 
number generator of M. Matsumoto and T. Nishimura, which is the default 
pseudo random number generator in Scilab for large-size random simulations. 

4--1- Criteria for comparison 

Pointwise criteria. We define two quantitative criteria that will be used to 
compare the ML and CV assessments of the predictive variance at prediction 
point xq. 

The first criterion is the Risk on Target Ratio (RTR), 



RTR{xo) 



El [(yo - v^y 



(9) 



with cP' being either tr^ 



ML 



or (7, 



From Definition 12. II we obtain 



CV 



RTR{xo) = 



El 



(El [iyo~-yo)'\y]-^^cl 
El [{m - yo?] 



(10) 



m 

tity El [(j/o — yo)'^\y\{the target in the RTR acronym) with the predictor tr^c^^. 
The denominator of ( 10 ) is, by the law of total expectation, the mean of the 
predictand Ei [(yg —yoPly]- Hence, the RTR in ([9]) is a relative prediction 
error, which is easily interpreted. 

We have the following bias-variance decomposition of the Risk, 



The numerator of (|10|) is the mean square error in predicting the random quan- 



El [{yo 



yo) 



El [ 



vari (El [{yo - yof\y] 



'-2 2 \ 



V 



bias 



variance 



(11) 

Hence the second criterion is the Bias on Target Ratio (BTR) and is the relative 
bias 

. |Ei [(yo-yo)^]-Ei {a^cl)\ 

Bl R(xo) - =r77T r^-. ■ (12) 

El [{yo ~ yoVl 
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The following equation summarizes the link between RTR and BTR: 



RTR 
>vrelative error/ 



BTR 
>vrelative bias/ 



vari (El [(yo - ?/o)^|y] 



El [{yo - yoY 



relative variance 



(13) 



Pointwise criteria when Ri = i?2- When i?i = i?2, Ei [(yo ~ 2/o)^|2/] does 
not depend on y. Therefore, the RTR and BTR simplify into RTR{xq) = 
\/Ei [(ct2 _ 1)2] and BTR{xo) = |1 - Ei(<t2)|. Hence, the RTR and BTR are 
the mean square error and the bias in the estimation of the true variance cr^ = 1, 
and RTR"^ = BTR^ + vari{a'^). 

Integrated criteria. In this paragraph, we define the two integrated versions of 
RTR and BTR over the prediction space X . Assume X is equipped with a 
probability measure ji. Then we define 



IRTR=J RTR^{xo)dfi{xo) 



(14) 



and 



IBTR=J / BTi?2(xo)d/^(xo). 



Hence we have the equivalent of ([T3| for IRTR and IBTR: 

vari (El [(yo - yo)2|y] - a^clj 



IRTR-" = IBTR^ 



X 



El [(yo - yo) 



dfi{xo). 



(15) 



(16) 



4-2. Designs of experiments studied 

We consider three different kinds of Designs Of Experiments (DOEs) of n 
observation points on the prediction space X = [0, 1]''. 

The first DOE is the Simple Random Sampling (SRS) design and consists 
of n independent observation points with uniform distributions on [0, 1]'^. This 
design may not be optimal for a Kriging prediction point of view, as it is likely 
to contain relatively large areas without observation points. However it is a 
convenient design for the estimation of covariance hyper-parameters because it 



may contain some points with small spacing. It is noted in Stein ( 1999 ch.6.9) 



that such points can dramatically improve the estimation of the covariance 
hyper-parameters. 

The second DOE is the Latin Hypercube Sampling Maximin (LHS-Maximin) 
design (see e.glSantner et al. (2003)). This design is one of the most widespread 



non-iterative designs in Kriging. To generate a LHS-Maximin design, we gener- 
ate 1000 LHS designs, and keep the one that maximizes the min criterion (i.e. 
the minimum distance between two different observation points). Let us notice 
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that this is the method used by the Matlab function Ihsdesignf..., 'maximin',k) 
which generates k LHS designs with default k = 5. 

The third DOE is a deterministic sparse regular grid. It is built according 



to the Smolyak ( 1963 ) sparse tensorization construction of the one-dimensional 
regular grid G = '^^^} of level I. 

The three DOEs are representative of the classical DOEs that can be used 
for interpolation of functions, going from the most irregular ones (SRS) to the 
most regular ones (sparse grid). 

4.3. Families of correlation functions studied 

We consider two classical families of stationary correlation functions. 

• The power-exponential correlation function family, parameterized by the 
vector of correlation lengths £ = {£1, ■■■,£d) and the power p. R is power- 
exponential {£,p) when 



i?(/.)=exp -V^ . (17) 




• The Matern correlation function family, parameterized by the vector of 
correlation lengths £ = {£i,...,£d) and the regularity parameter v. R is 
Matern {I, v) when 

^^^^ ^ r>^ (27^/1!,)" if. {2^\h\,) , (18) 



with \ h\i — \/j2i=i r Gamma function and K^, the modified Bessel 



function of second order. See e.g Stein (1999 p. 31) for a presentation of 
the Matern correlation function. 

4.4- Influence of the model error 

We study the influence of the model error, i.e. the difference between i?i 
and i?2- For different pairs i?i,i?2, we generate rip = 50 SRS and LHS learning 



samples, and the deterministic sparse grid (see Section 4.2). We compare the 
empirical means of the two integrated criteria IRTR and IBTR (see Section 4.1 1 
for the different DOEs and for ML and CV. IRTR and IBTR are calculated on 
a large test sample of size 5000. We take n ~ 70 for the learning sample size 
(actually n = 71 for the regular grid) and d = 5 for the dimension. 

For the pairs i?i,i?2, we consider the three following cases. First, i?i is 
power-exponential ((1.2, 1.2), 1.5) and R2 is power-exponential ((1.2, 1.2), P2) 
with varying p2- Second, Ri is Matern ((1.2, 1.2), 1.5) and R2 is Matern 
((1.2, 1.2),i/2) with varying 1^2- Finally, Ri is Matern ((1.2, 1.2), 1.5) and 
R2 is Matern ((^2, ^2), 1-5) with varying £2- 

On Figure [1] we plot the results for the SRS DOE. We clearly see that when 
the model error becomes large, CV becomes more efficient than ML in the sense 
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Figure 1: Influence of the model error for the SRS DOE (see Section [4!2| . Plot of the IRTR and 
IBTR integrated criteria (see Section [4.1[ | for ML and CV. Left: power-exponential correlation 
function with error on the exponent, the true exponent is pi = 1.5 and the model exponent p2 
varies in [1.2, 1.9]. Middle: Matern correlation function with error on the regularity parameter, 
the true regularity parameter is ui = 1.5 and the model regularity parameter 1/2 varies in 
[0.5,2.5]. Right: Matern correlation function with error on the correlation length, the true 
correlation length is £1 = 1.2 and the model correlation length £2 varies in [0.6,1.8]. ML is 
optimal when there is no model error while CV is more robust to model misspecifications. 




Figure 2: Same setting as in Figure[T] but with the LHS-Maximin DOE (see Section [X2| l. ML 
is optimal when there is no model error while CV is more robust to model misspecifications. 



of IRTR. Looking at (16), one can see that tlic IRTR is composed of IBTR 
and of an integrated relative variance term. When R2 becomes different from 
i?i , the IBTR contribution increases faster than the integrated relative variance 
contribution, especially for ML. Hence, the main reason why CV is more robust 
than ML to model misspecification is that its bias increases more slowly with 
the model misspecification. 

On Figure [2] we plot the results for the LHS-Maximin DOE. The results 
are similar to these of the SRS DOE. They also appear to be slightly more 
pronounced, the IRTR of CV being smaller than the IRTR of ML for a smaller 
model error. 

On Figure [3j we plot the results for the regular grid DOE. The results 
are radically different from the ones obtained with the SRS and LHS-Maximin 
designs. The first comment is that the assessment of predictive variances is much 
more difficult in case of model misspecification (compare the min, between ML 
and CV, of IRTR for the SRS and LHS-Maximin designs against the regular 
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Figure 3: Same setting as in Figure[T]but with the regular sparse grid DOE (see Section |4.2[ |. 
The results are radically different from the ones obtained with the SRS and LHS-Maximin 
DOEs. This time CV is less robust to misspecifications of the correlation function. 



grid) . This is especially true for misspecifications on the exponent for the power- 
exponential correlation function and on the regularity parameter for the Matern 
function. The second comment is that this time CV appears to be less robust 
than ML to model misspecification. In particular, its bias increases faster than 
ML bias with model misspecification and can be very large. Indeed, having 
observation points that are on a regular grid, CV estimates a hyper-parameter 
adapted only to predictions on the regular grid. Because of the correlation 
function misspecification, this does not generalize at all to predictions outside 
the regular grid. Hence, CV is efficient to assess predictive variances at the 
points of the regular grid but not to assess predictive variances outside the 
regular grid. This is less accentuated for ML because ML estimates a general- 
purpose and not a for the purpose of assessing predictive variances at 
particular points. Furthermore, it is noted in looss et al. (2010) that removing 
a point from a highly structured DOE breaks its structure, which may yield 
overpessimistic CV results. 

We conclude from these numerical results that, for the SRS and LHS-Maximin 
designs of experiments, CV is more robust to model misspecification. It is the 
contrary for the regular grid, for the structural reasons presented above. This 
being said, we do not consider the regular grid anymore in the following numer- 
ical results and only consider the SRS and LHS-Maximin designs. Let us finally 
notice that the regular grid is not particularly a Kriging-oriented DOE. Indeed, 
for instance, for n = 71, it remains only 17 distinct points when projecting on 
the first two base vectors. 



4-. 5. Influence of the number of points 

Using the same procedure as in Section [4.4[ we still set d = 5 and we vary 
the learning sample size n. The pair i?i,i?2 is fixed in the three following dif- 
ferent cases. First, i?i is power-exponential ((1.2, 1.2), 1.5) and i?2 is power- 
exponential ((1.2, 1.2), 1.7). Second, i?i is Matern ((1.2, 1.2), 1.5) and i?2 
is Matern ((1.2, 1.2), 1.8). Finally, i?i is Matern ((1.2, 1.2), 1.5) and R2 is 
Matern ((1.8, 1.8), 1.5). This time, we do not consider integrated quantities 
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Figure 4: Influence of the number n of observation points for the SRS DOE (see Section 
|4.2^ . Plot of the RTR and BTR criteria (see Section |4.1| l for prediction at the center of the 
domain and for ML and CV. Left: power-exponential correlation function with error on the 
exponent, the true exponent is pi = 1.5 and the model exponent is p2 = 1.7. Middle: Matern 
correlation function with error on the regularity parameter, the true regularity parameter is 
ui = 1.5 and the model regularity parameter is U2 = 1-8. Right: Matern correlation function 
(u = |) with error on the correlation length, the true correlation length is ii = 1.2 and the 
model correlation length is £2 = 1-8. 



of interest and focus on the prediction on the point xq having all its components 
set to I (center of domain). 

On Figure |4] we plot the results for the SRS DOE. The first comment is that, 
as n increases, the BTR does not vanish, but seems to reach a limit value. This 
limit value is smaller for CV for the three pairs i?2. Recalling from ( 13 ) that 
RTR is the sum of BTR and of a relative variance term, we observe that this 
relative variance term decreases and seems to vanish when n increases (because 
BTR becomes closer to RTR). The decrease is much slower for the error on 
the correlation length than for the two other errors on the correlation function. 
Furthermore, the relative variance term decreases more slowly for CV than for 
ML. Finally, because CV is better than ML for the BTR and worse than ML for 
the relative variance, and because the contribution of BTR to RTR increases 
with n, the ratio of the RTR of ML over the RTR of CV increases with n. This 
ratio can be smaller than 1 for very small n and eventually becomes larger than 
1 as n increases (meaning that CV does better than ML). 

On Figure [5] we plot the results for the LHS-Maximin DOE. The results are 
similar to these of the SRS DOE. The RTR of CV is smaller than the RTR of 
ML for a slightly smaller n. This confirms the results of Section 4.4 where the 
model error for which the IRTR of ML reaches the IRTR of CV is smaller for 
LHS-Maximin than for SRS. 



12 



0.5^ 
0.4< 

0.2- 
0.1- 



VRTRMI 



0.5 
0.4 

: 

0.2 



20 40 60 80 100 120 140 160 180 200 



OBTRCV 

¥RTRMI 
RTRC\; 



** ••••••», 

^000000<>00000<''>00< 



20 40 60 80 100 120 140 160 180 200 



0.4 
, 0.3 

0.2: 
0.1: 

0.0 



O BTR CV 

SRTRMl 



° " ^ o ,> o » , 
100 200 300 400 500 600 



8 h 



Figure 5; Same setting as in Figure|4] but with the LHS-Maximin DOE (see Section |4.2| 



5. Study on analytical functions 

In this section, we compare ML and CV on analytical functions, instead 
of realizations of Gaussian processes, as was the case in Section |4] The first 
goal is to illustrate the results of Section |4] on the estimation of the variance 
hyper-parameter. Indeed, the study of Section [4] is more related to the theory 
of Kriging (we work on Gaussian processes) while this section is more related to 
the application of Kriging (modeling of deterministic functions as realizations 
of Gaussian processes) . The second goal is to generalize Section |4] to the case 
where correlation hyper-parameters are estimated from data. 

5.1. ML and CV estimations of covariance hyper-parameters 

We consider a set of observations {xi,yi), {xmPn) as in Section [2] and 
the family {a^Rg,a^ > 0,9 G 0} of stationary covariance functions, with Rg 
a stationary correlation function, and 8 a finite-dimensional set. We denote 
by Eg and varg the means and variances with respect to the distribution of 
a stationary Gaussian process with mean zero, variance one and correlation 
function Eg. 

The ML estimate of (o'^,6') is 9hil G argmiuggQ |re|^/"(7j^^(r0) (see e.g 



Marrel et al. (20081), with Tg the correlation matrix of the training sample with 



correlation function Rg, and af^j^iTg) as in &fi]^{Tg_^^^) is the estimate of 
For CV we choose to use the following estimator for the hyper-parameter 9: 



9cv e argmin V'(yj - yifi{y_i)Y 
066 ~: 



(19) 



with yj_e(?/_i) = ¥.g{yi\yi, ...,yi_i,yi+i, ...,yn). This is the Leave-One-Out mean 
square error criterion that is used, for instance, in Santner et al. (2003) when 



the ML and CV estimations of 9 are compared. Given 9cv and hence T^^^ , a 
is estimated by ([t]). 
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After el emen tary transformations for the estimator 9ml and the use of 



Proposition 



3.1 



for the estimator 



the functions to minimize are 



and 



fMLiO) = - \ogdet{rg) + log (y^r-iy) 



fcv{0)=y'T-'diag{r, 



(20) 



(21) 



In the supplementary material, we recall the closed-form expressions of the gra- 
dients of /a/l and fcv j as functions of the first-order derivatives of the correla- 
tion function. These expressions are available in the literature, see e.g |Mardia| 
and Marshall] (|1984[) for ML and|Sundararajan and Keerthi| ([20M]) for CV. The 



evaluations of the two functions and their gradients have similar computational 
complexities of the order of 0{n^). 

Once we have the closed-form expressions of the gradients at hand, our opti- 
mization procedure is based on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) 
quasi-Newton optimization method, implemented in the Scilab function optim. 
Since the functions Jml and fcv may have multiple local minima, the BFGS 
method is run several times, by taking the initial points in a LHS-Maximin 



design. The presence of multiple local minima is discussed e.g in Martin and 



Simpson (2004). An important point is that, when is a correlation length, we 
recommend to use its logarithm to run the optimization. Indeed a correlation 
length acts as a multiplier in the correlation, so that using its log ensures that 
a given perturbation has the same importance, whether applied to a large or a 
small correlation length. Furthermore, when one wants to explore uniformly the 
space of correlation lengths, as is the case with a LHS design, using directly the 
correlation lengths may give too much emphasis on large correlation lengths, 
which is avoided by using their log. 

Another important issue is the numerical inversion of the correlation matrix. 
This issue is even more significant when the correlation matrix is ill-conditioned, 
which happens when the correlation function is smooth (Gaussian or Matern 
with a large regularity parameter). To tackle this issue we recommend to use 
the Nugget effect. More specifically, for a given correlation matrix Tg, we actu- 
ally compute Tg +T^In, with = 10~^ in our simulations. A detailed analysis 
of the influence of the Nugget effect on the hyper-parameter estimation and 



on the Kriging prediction is carried out in Andrianakis and Challenor (2012) 



However, for the CV estimation of 9, when the correlation function belongs to 
the Gaussian family, or the Matern family with large regularity parameter, an- 
other structural problem appears. For &-^y very large, as the overall predictive 
variance term iTpy(l — ^gTg^^g) has the same order of magnitude as the ob- 
servations, the term 1 — 'jgT'^^^g is very small. Hence, a fixed numerical error 
on the inversion of Tg, however small it is, may cause the term 1 — jgTg^^g 
to be negative. This is what we observe for the CV case when fitting e.g the 
correlation lengths of a Gaussian correlation function. The heuristic scheme 
is that large correlation lengths are estimated, which yields large o-^y, which 
yields small (1 — ^gTg^^g), so possibly negative ones. Notice however that the 
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relative errors of the Kriging prediction terms jgTg^y are correct. It is noted in 



Martin and Simpson (2004 p. 7) that CV may overestimate correlation lengths. 
Hence, to have appropriate predictive variances, one has to ensure that the es- 
timated correlation lengths are not too large. Two possible solutions are to 
penalize either too large correlation lengths or too large d'^y in the minimiza- 
tion of fcv- We choose here the second solution because our experience is that 
the ideal penalty on the correlation lengths, both ensuring reliable predictive 
variance computation and having a minimal effect on the 9 estimation, depends 
on the DOE substantially. In practice, we use a penalty for ir^y starting at 1000 
times the empirical variance ^ify- This penalty is needed only for CV when 
the correlation function is Gaussian or Matern with free regularity parameter. 

5. 2. Procedure 

We consider a deterministic function / on [0, 1]''. We generate rip — 100 
LHS-Maximin training samples of the form Xa,i, f{xa^i), ■■■TXa,m f{xa,n)- We 
denote ya^i — f{xa.i)- From each training sample, we estimate and with 
the ML and CV methods presented in Section [O] 

We are interested in two criteria on a large Monte Carlo test sample i, /(x^ i), X(_„j , f{xt^nt) 
on [0,1]'' (nt = 10000). We denote yt^ = f{xta)-, yt.iiVa) = ^e^ytAVa) and 
a'^c'^i{ya) = a'^varg{ytAya), where comes from either the ML or CV method. 

The first criterion is the Mean Square Error (MSE) and evaluates the pre- 
diction capability of the estimated correlation function Rf. 



1 "* 

-E(2^M-yt.(2/a))'. (22) 



The second criterion is the Predictive Variance Adequation (PVA): 



log 



nt ^ cr^cl.,{ya) 



(23) 



This criterion evaluates the quality of the predictive variances given by the es- 
timated covariance hyper-parameters The smaller the PVA is, the better 
it is because the predictive variances are globally of the same order than the 
prediction errors, so that the confidence intervals are reliable. We use the loga- 
rithm in order to give the same weight to relative overestimation and to relative 
underestimation of the prediction errors. 

We finally average the two criteria over the rip training samples. 

5.3. Analytical functions studied 

We study the two following analytical functions. The first one, for d = 3, is 
the Ishigami function: 

/(xi, X2, X3) = sin(— 7r-t-27ra;i)-|-7sin((— 7r+27rx2))^+0.1(— 7r+27ra;3)'* sin(— 7r+27r2;i). 

(24) 
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The second one, for c? = 10, is a simphfied version of the Morris function 
(|Morris[[l99l|), 



with Wi{x) 



/(a;) = ^ + ^ Wi{x)w-i{x) + ^ Wi{x)wj{x)wk{x) 

1=1 l<i<j<6 l<i<j<fe<5 

+Wi {x)w2 {x)w3 {x)'W4 (x) , 

'2(3^-0-5), if* = 3,5,7 
^2{xi — 0.5) otherwise. 



5.4- Results with enforced correlation lengths 

We work with the Ishigami function, with n — 100 observation points. For 
the correlation function family, we study the exponential and Gaussian families 
(power-exponential family of (17) with enforced p = 1 for exponential and p = 2 
for Gaussian). 

For each of these two correlation models, we enforce three vectors £ of cor- 
relation lengths for R: an arbitrary isotropic correlation length, a well-chosen 
isotropic correlation length and three well-chosen correlation lengths along the 
three dimensions. To obtain a well-chosen isotropic correlation length, we gen- 
erate rip = 100 LHS-Maximin DOEs, for which we estimate the correlation 



length by ML and CV as in Section |5.1[ Wc calculate each time the MSE on 
a test sample of size 10000 and the well-chosen correlation length is the one 
with the smallest MSE among the 2np estimated correlation lengths. The three 
well-chosen correlation lengths are obtained similarly. The three vectors of cor- 
relation lengths yield an increasing prediction quality. 

The results are presented in Table [T] The first comment is that, comparing 
line 1, 2 and 3 against line 4, 5 and 6, the Gaussian family is more appropriate 
than the exponential one for the Ishigami function. Indeed, it yields the smallest 
MSE among the cases when one uses three different correlation lengths, and the 
PVA is quite small as well. This could be anticipated since the Ishigami func- 
tion is smooth, so a Gaussian correlation model (smooth trajectories) is more 
adapted than an exponential one (rough trajectories). The second comment is 
that the benefit obtained by replacing an isotropic correlation length by differ- 
ent correlation lengths is smaller for the exponential class than for the Gaussian 
one. Finally, we see that CV yields much smaller PVAs than ML in line 1, 2, 3 
and 4, in the cases when the correlation function is not appropriate. For line 6, 
which is the most appropriate correlation function, ML yields a PVA compara- 
ble to CV and for line 5, ML PVA is smaller than CV PVA. All these comments 
are in agreement with the main result of Section |4j The ML estimation of cr^ is 
more appropriate when the correlation function is well-specified while the CV 
estimation is more appropriate when the correlation function is misspecified. 
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Correlation model 


Enforced hyper-parameters 


MSE 




PVA 




exponential 


[1,1,1] 


2.01 


ML 


0.50 CV 


0.20 


exponential 


[1.3,1.3,1.3] 


1.94 


ML 


0.46 CV 


0.23 


exponential 


[1.20,5.03,2.60] 


1.70 


ML 


0.54 CV 


0.19 


Gaussian 


[0.5,0.5,0.5] 


4.19 


ML 


0.98 CV 


0.35 


Gaussian 


[0.31,0.31,0.31] 


2.03 


ML 


0.16 CV 


0.23 


Gaussian 


[0.38,0.32,0.42] 


1.32 


ML 


0.28 CV 


0.29 



Table 1: Mean of the MSE and PVA criteria for the Ishigami function for different fixed 
correlation models. The MSE is the same between ML and CV as the same correlation 
function is used. When the correlation model is misspecified, the MSE is large and CV does 
better than ML for the PVA criterion. 



5.5. Results with estimated correlation lengths 

We work with the Ishigami and Morris functions, with n = 100 observation 
points. We use the exponential and Gaussian models as in Section [5.4[ as well 



as the Matern model of (18 1. We distinguish two subcases for the vector i of 
correlation lengths. In Case i we estimate a single isotropic correlation length, 
while in Case a we estimate d correlation lengths for the d dimensions. 

In Table [2] we present the results. For both the Ishigami and Morris func- 
tions, the Gaussian model yields smaller MSEs than the exponential model. 
Indeed, both functions are smooth. Over the different DOEs, we observe that 
the estimated Matern regularity hyper-parameters are large, so that the MSEs 
and the PVAs for the Matern model are similar to the ones of the Gaussian 
model. Let us notice that for the Ishigami function, the relatively large number 
n — 100 of observation points is required for the Gaussian correlation model 
to be more adapted than the exponential one. Indeed, in Table |3j we show 
the same results with n = 70 where the Gaussian model yields relatively larger 
MSEs and substantially larger PVAs. Our interpretation is that the linear in- 
terpolation that the exponential correlation function yields can be sufficient, 
even for a smooth function, if there are not enough observation points. We also 
notice that, generally, estimating different correlation lengths (Case a) yields a 
smaller MSE than estimating one isotropic correlation length (Case i). In our 
simulations this is always true except for the Ishigami function with the expo- 
nential model. Indeed, we see in Table [l] that we get a relatively small benefit 
for the Ishigami function from using different correlation lengths. Here, this 
benefit is compensated by an error in the estimation of the 3 correlation lengths 
with n = 100 observation points. The overall conclusion is that the Gaussian 
and Matern correlation models are more adapted than the exponential one, and 
that using different correlation lengths is more adapted than using an isotropic 
one, provided that there are enough data to estimate these correlation lengths. 

In the exponential case, for both Cases i and a, CV always yields a smaller 
PVA than ML and yields a MSE that is smaller or similar. In Case a, for the 
Gaussian and Matern correlation functions, the most adapted ones, ML always 
yields MSEs and PVAs smaller than CV or similar. Furthermore, for the Morris 
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Function Correlation model MSB PVA 



Ishigami 


exponential Case i 


ML 


-1 
1 


99 


C V 


-1 

1 


9( 


ML: 





35 


CV: 





16 


Ishigami 


exponential Case a 


ML 


2 


01 


CV 


1 


77 


ML: 





36 


CV: 





24 


Ishigami 


Gaussian Case i 


ML 


2 


06 


CV 


2 


11 


ML: 





18 


CV: 





22 


Ishigami 


Gaussian Case a 


ML 


1 


50 


CV 


1 


53 


ML: 





53 


CV: 





50 


Ishigami 


Matern Case i 


ML 


2 


19 


CV 


2 


29 


ML: 





18 


CV: 





23 


Ishigami 


Matern Case a 


ML 


1 


69 


CV 


1 


67 


ML: 





38 


CV: 





41 


Morris 


exponential Case i 


ML 


3 


07 


CV 


2 


99 


ML: 





31 


CV: 





24 


Morris 


exponential Case a 


ML 


2 


03 


CV 


1 


99 


ML: 





29 


CV: 





21 


Morris 


Gaussian Case i 


ML 


1 


33 


CV 


1 


36 


ML: 





26 


CV: 





26 


Morris 


Gaussian Case a 


ML 





86 


CV 


1 


21 


ML: 





79 


CV: 


1 


56 


Morris 


Matern Case i 


ML 


1 


26 


CV 


1 


28 


ML: 





24 


CV: 





25 


Morris 


Matern Case a 


ML 





75 


CV 


1 


06 


ML: 





65 


CV: 


1 


43 



Table 2: n = 100 observation points. Mean of the MSE and PVA criteria over n-p = 100 
LHS-Maximin DOEs for the Ishigami (d = 3) and Morris [d = 10) functions for different fixed 
correlation models. When the model is misspecified, the MSE is large and the CV does better 
compared to ML for the MSE and PVA criterion. 



Function 


Correlation model 




MSE 






PVA 


Ishigami 
Ishigami 


exponential Case a 
Gaussian Case a 


ML: 
ML: 


3.23 CV: 
3.15 CV: 


2.91 
4.13 


ML: 
ML: 


0.27 CV: 0.26 
0.72 CV: 0.76 



Table 3: n = 70 observation points. Mean of the MSE and PVA criteria over np = 100 LHS- 
Maximin DOEs for the Ishigami (d = 3) and Morris (d = 10) functions for the exponential 
correlation model. Contrary to the case n = 100 of Table [2] the Gaussian correlation model 
docs not yield smaller MSEs than the exponential one. 



function with Matern and Gaussian correlation functions, going from Case i to 
Case a enhances the advantage of ML over CV. 

From the discussion above, we conclude that the numerical experiments yield 
results, for the deterministic functions considered here, that are in agreement 
with the conclusion of Section |4} ML is optimal for the best adapted correlation 
models, while CV is more robust to cases of model misspecification. 



5. 6. Case of universal Kriging 

So far, the case of simple Kriging has been considered, for which the un- 
derlying Gaussian process is considered centered. The case of universal Kriging 
can equally be studied, for which this process is considered to have a mean at 
location x of the form X^iLi Pi9i{^)i with known functions gi and unknown co- 
efficients Pi. For instance a closed-form formula similar to that of Proposition 
2.2 can be obtained in the same fashion, and virtual LOO formulas are also 
available (Dubrule 19831. We have chosen to focus on the simple Kriging case 



because we are able to address as precisely as possible the issue of the covari- 
ance function class misspecification, the Kriging model depending only on the 
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Function Mean function model Correlation model MSE 



PVA 



Ishigami constant exponential Case a ML: 1.96 CV: 1.74 ML: 0.39 CV: 0.24 

Ishigami affine exponential Case a ML: 1.98 CV: 1.75 ML: 0.40 CV: 0.24 

Ishigami constant Gaussian Case a ML: 1.54 CV: 1.63 ML: 0.54 CV: 0.54 

Ishigami affine Gaussian Case a ML: 1.58 CV: 1.78 ML: 0.57 CV: 0.57 



Table 4: n = 100 observation points. Mean of the MSE and PVA criteria over rip = 100 
LHS-Maximin DOEs for the Ishigami {d = 3) function and the exponential and Gaussian 
correlation models. The incorporation of the mean function does not change the conclusions 
of Table H 



covariance function choice. Furthermore it is shown in Stein (1999 p. 138) that 
the issue of the mean function choice for the Kriging model is much less crucial 
than that of the covariance function choice. 

Nevertheless, in Table |4] we study, for the Ishigami function, the influence of 
using a universal Kriging model with either a constant or affine mean function. 
The process is the same as for Table [2j We first see that using a non-zero mean 
does not improve significantly the Kriging model. It is possible to observe a 
slight improvement only with the exponential covariance structure, which we 
can interpret because a smooth mean function makes the Kriging model more 
adapted to the smooth Ishigami function. On the contrary, for the Gaussian 
covariance structure, the mean function over-parameterizes the Kriging model 
and slightly damages its performances. Let us also notice that CV appears 
to be more sensitive to this over-parameterization, its MSE increasing with the 
complexity of the mean function. This can be observed similarly in the numerical 
experiments in Martin and Simpson (2004). The second overall conclusion is 
that the main finding of Section |4] and of Table [2] is confirmed: CV has smaller 
MSE and PVA for the misspecified exponential structure, while ML is optimal 
for the Gaussian covariance structure which is the most adapted and yields the 
smallest MSE. 



6. Conclusion 

In this paper, we have carried out a detailed analysis of ML and CV for the 
estimation of the covariance hyper-parameters of a Gaussian process, with a 
misspecified parametric family of covariance functions. This analysis has been 
carried out using a two-step approach. We have first studied the estimation of a 
global variance hyper-parameter, for which the correlation function is misspeci- 
fied. In this framework, we can control precisely the degree of model misspecifi- 
cation and we obtain closed-form expressions for the mismatch indices that we 
have introduced. We conclude from the numerical study of these formulas that 
when the model is misspecified, CV performs better than ML. Second, we have 
studied the general case when the correlation hyper-parameters are estimated 
from data via numerical experiments on analytical functions. We confirm the 
results of the first step, and generalize them. 
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Because CV is more robust to model misspecification, it is less likely than 
ML to yield substantially incorrect predictions. However, ML is more likely 
to yield the best predictions, under the condition that the correlation function 
family is ideally specified. Hence, CV is a suitable method for problems when 
one gives priority to robustness over best possible performance. Studying data- 
driven methods, to choose between ML and CV according to a specific objective, 
may motivate further research. 

We also pointed out in Section 4.4 that some DOEs that are too regular are 
not suitable for CV. Investigating this further and studying numerical criteria 
for quantifying the suitability of a DOE for CV could motivate further research. 
One could be also interested in CV-oriented adaptative improvement of a DOE. 
Finally, we emphasized the need, in some cases, for a penalization of large 
correlation lengths for CV. We enforced this penalization via a penalization of 
too large estimated global variance hyper-parameters. The motivation here was 
purely numerical, but there could be a statistical motivation for doing so. The 
statistical analysis of penalization methods, similar to the one proposed here, for 
CV may motivate further research. In the ML case, there exist penalizations of 
correlation hyper-parameters that both reduce the small sample variance, and 
recover the asymptotic distribution of the non-penalized case, as shown in O] 



and Sudjianto (2005). 
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Appendix A. Proof of Proposition 2.2 

Using the definition of the Risk, the expression of <t2, (U and (|4|, we get: 

7^*^.0 = Ei[(7jrr^2/-72r2-^y)' + i-7lrrSi 

-y*My(l-7^r2-S)]' 

= El [z/*(r2-S2 - rrSi)(72r2"' - 7?rr')y 
+1 - 7jrrSi - y*M2/(i - 72r2-V2)] 



Then, writing y — T'lz with z A/'(0, I„), we get: 



with 
and 



7e^2^,„ = El (z*Mo2 + ci - c2z*Miz)' , 



Mo = rf (r2-V2 - rrSi)(72r2-^ - 7jrr')rf 



Ml = Tj^Mrf 



(A.l) 



To compute this expression, we use the following lemma. The proof relies only 
on 4**^ moment calculation for Gaussian variables and is omitted. 

Lemma Appendix A.l. Let z ^i A/'(0, !„), and A and B be n x n real 
symmetric matrices. Then: 

Ei(z*Azz*Bz) = /(A,B). 



Using the lemma and expanding ( A.l ) yields 



= /(Mo,Mo) + 2citr(Mo)~2c2/(Mo,Mi 



(A.2) 



+cl - 2ciC2tr(Mi) + c2/(Mi, Ml). 
Finally, based on tr(AB) = ir(BA), we can replace Mq and Mi by Mq and 



Ml in (A.2), which completes the proof. 



Appendix B. On the optimality of <t^j_ when i?i = R2 

Here, we consider the case R2 = i?i . We first recall the Cramer-Rao inequal- 
ity, which states that when ct^ = 1, for an unbiased estimator of a^: 



_d_ 



{\nL{y,a'))^,^^ 
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with, L{y,a^) cx ^ exp ^— ^ ^^^2 the UkeUhood of the observations. We 



then calculate the Cramer-Rao bound: 





(- 




r 9 








H 


2 




7 

n 





where we used Lemma 



.Appendix A.l with A = B = I„ to showEi {{y*T-^ y)^) = 
+ 2n. Hence, the Cramcr-Rao bound of the statistical model C is - when 



Finally, varii&lj^) = varii^y^Tj^^y) = ^vari{J2"=i 
r^^y^i AA(0,I„). 



Wl 



th 



Appendix C. Convergence of (|6| 

Proposition Appendix C.l. Assume X = M."^ and that the observation points 
are distinct and located on the infinite regular grid {{iid, id5), i^) S Z'^} 

for fixed S > 0. When a^R2 is the covariance function of the Gaussian process 
Y , is summable on the regular grid, and has a strictly positive continuous Fourier 
transform, then the criterion Cloo has mean one and converges in mean square 
to one as n ^ +00. 



Proof. Consider, for simplicity, = 1. Using the formulas of Proposition 3.1 
_ 1 — 
and introducing z — T2'^ y ^2 A/'(0,I„) yields: 

Cloo ^ 1 ^ fc^l^ ^ l.^r"' [d^ag{T^^)]-' T-^, 
n Cj n 

with diag{T2^) the matrix obtained by setting to all non-diagonal terms of 
r^^- Then, 

MCloo) = Itr (r-' [d^agiT2^)]-' r"^) = 1 f: tSt" = 



Furthermore, 

var^iCLOo) = ^tr (t^' [dzagiT^')] r^'^ [diagiT^')] . (C.l) 
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Then, with Xmin(X) ^'^d Xmax{^) the smahest and largest eigenvalues of a 
symmetric, strictly positive, matrix F, 

Xr„ax{[diag{T^'^)] ^) ^ max } < ]—Y\ = X,nax{T2)- (C.2) 



Hence, from (C.l) and (C.2), 



2 / Amirt(r2) 



< var2{CLOo) < 



2 ( AmarE(r2) 



n \ \min{^2) , 

Hence < infX„iin{^2) < supXmax(r2) < +oo implies the proposition. 

n n 

First, for all n, Xmaxi^2) is smaller than the largest norm of the rows of 
T2 and hence is smaller than the absolute sum of R2 over the infinite regular 
grid. Hence we have supXmax{^2) < +00. 



over 



Then, let R2 be the Fourier transform of i?2, and let Rmin > be its infimum 
Y- For all n, xi, Xn in the infinite regular grid, and for all e G M": 



7r 7r ]d 
' (5 ' 5 



e*r2e = ^ R2{xk - xi)ekei = ^ £^6; 



fc,/=l 



k,l=l 
2 



E 



fc=i 



E 

fc=i 



dt — Rr, 



/n 



fe.i=l 



Then, because J[_k ilj e^^^^dt = for a non-zero integer p, and because the Xk, 
1 < fc < n, are distinct and located on the regular grid, we have e**(="'=-^')rfi 
for /c ^ L Hence: 



/n 



27r 



'Ti fe=i 

which in turn implies < infXjnin{^2) and thus completes the proof. 
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